HW: Label the entire notebook with comments on what operation is each code chunk doing and what is the outcome
Reading the penguins dataset
library(tidyverse)
library(plotly)
Penguins<-read.csv("penguins_size.csv")
We see that the data has 344 rows and 7 columns
Penguins
dim(Penguins)
[1] 344 7
Looking at summary stats
summary(Penguins)
species island culmen_length_mm culmen_depth_mm flipper_length_mm
Length:344 Length:344 Min. :32.10 Min. :13.10 Min. :172.0
Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
Mode :character Mode :character Median :44.45 Median :17.30 Median :197.0
Mean :43.92 Mean :17.15 Mean :200.9
3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
Max. :59.60 Max. :21.50 Max. :231.0
NA's :2 NA's :2 NA's :2
body_mass_g sex
Min. :2700 Length:344
1st Qu.:3550 Class :character
Median :4050 Mode :character
Mean :4202
3rd Qu.:4750
Max. :6300
NA's :2
We notice that Species and island are read in as characters. We will convert them to factor.
Penguins$species<-as.factor(Penguins$species)
Penguins$island<-as.factor(Penguins$island)
Penguins$sex<-as.factor(Penguins$sex)
Looking at summary again
summary(Penguins)
species island culmen_length_mm culmen_depth_mm flipper_length_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10 Min. :172.0
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30 Median :197.0
Mean :43.92 Mean :17.15 Mean :200.9
3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
Max. :59.60 Max. :21.50 Max. :231.0
NA's :2 NA's :2 NA's :2
body_mass_g sex
Min. :2700 . : 1
1st Qu.:3550 FEMALE:165
Median :4050 MALE :168
Mean :4202 NA's : 10
3rd Qu.:4750
Max. :6300
NA's :2
We will remove the null values and also any erroneous values for sex
Clean_DF<-na.omit(Penguins)
Clean_DF<-Clean_DF%>%
filter(sex == 'FEMALE' | sex == 'MALE')
This leaves us with the below summary stats:
summary(Clean_DF)
species island culmen_length_mm culmen_depth_mm flipper_length_mm
Adelie :146 Biscoe :163 Min. :32.10 Min. :13.10 Min. :172
Chinstrap: 68 Dream :123 1st Qu.:39.50 1st Qu.:15.60 1st Qu.:190
Gentoo :119 Torgersen: 47 Median :44.50 Median :17.30 Median :197
Mean :43.99 Mean :17.16 Mean :201
3rd Qu.:48.60 3rd Qu.:18.70 3rd Qu.:213
Max. :59.60 Max. :21.50 Max. :231
body_mass_g sex
Min. :2700 Length:333
1st Qu.:3550 Class :character
Median :4050 Mode :character
Mean :4207
3rd Qu.:4775
Max. :6300
Clean_DF$sex<-as.factor(Clean_DF$sex)
We observe that there are 3 species 146 Adelie, 68 Chinstrap, 119 Gentoo. These penguins are spread across 3 islands named Biscoe, Dream and Torgersen The culmen length of the penguins ranged from 32.10 mm to 59.60 mm and have a average value of 43.99 mm and median value of 44.50 mm The culmen depth of the penguins ranged from 13.10 mm to 21.50 mm and have a average value of 17.16 mm and median value of 17.30 mm The flipper length of the penguins ranged from 172 mm to 231 mm and have a average value of 201 mm and median value of 197 mm The body mass of the penguins ranged from 2700 g to 6300 g and have a average value of 4207 g and median value of 4050 g Also there are 165 Female and 168 Male penguins
summary(Clean_DF)
species island culmen_length_mm culmen_depth_mm flipper_length_mm
Adelie :146 Biscoe :163 Min. :32.10 Min. :13.10 Min. :172
Chinstrap: 68 Dream :123 1st Qu.:39.50 1st Qu.:15.60 1st Qu.:190
Gentoo :119 Torgersen: 47 Median :44.50 Median :17.30 Median :197
Mean :43.99 Mean :17.16 Mean :201
3rd Qu.:48.60 3rd Qu.:18.70 3rd Qu.:213
Max. :59.60 Max. :21.50 Max. :231
body_mass_g sex
Min. :2700 FEMALE:165
1st Qu.:3550 MALE :168
Median :4050
Mean :4207
3rd Qu.:4775
Max. :6300
We now graphically represent the same summary statistics in a boxplot to also check if there are ouliers. There are no outliers in culmen length, culmen depth or flipper length
boxplot(Clean_DF[3:5])
There are no outliers in body mass either
boxplot(Clean_DF$body_mass_g)
We now look at the distribution of the body mass based on sex. There are no outliers however the distributions are different for each sex Minimum weight for female is below 3000 g while that for males is above 3000 g. Maximum weight for females is slightly above 5000 while it is above 6000 for males The median value for female is between 3000 and 4000 while that of male is between 4000 and 5000
p<-ggplot(Clean_DF, aes(sex, body_mass_g, fill=sex))+
geom_boxplot()
ggplotly(p)
We now look at the distribution of the culmen length based on sex. There are no outliers however the distributions are different for each sex Minimum value for female is slightly below the minimum value for males, both lie between 30 and 40. Maximum for females and males are close to 60 but the max for females is slightly lower Median for females and males are close to 45 but the median for females is slightly lower
p<-ggplot(Clean_DF, aes(sex, culmen_length_mm, fill=sex))+
geom_boxplot()
ggplotly(p)
We now look at the distribution of the culmen depth based on sex. There are no outliers however the distributions are different for each sex Minimum value for female is slightly below the minimum value for males, both lie between 12.5 and 15. Maximum for females and males are greater than 20 but the max for females is slightly lower Median for females is below 17.5 and that of males is above 17.5
p<-ggplot(Clean_DF, aes(sex, culmen_depth_mm, fill=sex))+
geom_boxplot()
ggplotly(p)
We now look at the distribution of the flipper length based on sex. There are no outliers however the distributions are different for each sex Minimum value for female is slightly below the minimum value for males, both lie between 170 and 180. Maximum for females and males are greater than 220 and 230 respectively. Median for females is above 190 and that of males is above 200
p<-ggplot(Clean_DF, aes(sex, flipper_length_mm, fill=sex))+
geom_boxplot()
ggplotly(p)
Looking at body measures by Species
p<-ggplot(Clean_DF, aes(species, body_mass_g, fill=species))+
geom_boxplot()
ggplotly(p)
p<-ggplot(Clean_DF, aes(species, culmen_length_mm, fill=species))+
geom_boxplot()
ggplotly(p)
p<-ggplot(Clean_DF, aes(species, culmen_depth_mm, fill=species))+
geom_boxplot()
ggplotly(p)
p<-ggplot(Clean_DF, aes(species, flipper_length_mm, fill=species))+
geom_boxplot()
ggplotly(p)
p<-ggplot(data = Clean_DF) +
geom_bar(mapping = aes(x = species, fill=species))
ggplotly(p)
We see here that the Adelie species is spread across all islands where as Chinstrap is exclusive to Dream island while Gentoo is exclusive to Biscoe
p<-ggplot(data = Clean_DF) +
geom_bar(mapping = aes(x = island, fill=species))
ggplotly(p)
We observe from the scatterplot that the different species are almost separable based on the length and depth of the culmen. The red points represent Adelie, green point chinstrap and blue Gentoo. Also there is a positive correlation between the two variables
p<-ggplot(data = Clean_DF) +
geom_point(mapping = aes(x = culmen_length_mm, y = culmen_depth_mm,color = species, shape=island))
ggplotly(p)
We observe from the scatterplot that the different species are almost separable based on the length of the culmen and flipper. The red points represent Adelie, green point chinstrap and blue Gentoo.
The flipper length is also highly correlated with the culmen length
p<-ggplot(data = Clean_DF) +
geom_point(mapping = aes(x = culmen_length_mm, y = flipper_length_mm,color = species, shape=island))
ggplotly(p)
p<-ggplot(data = Clean_DF) +
geom_point(mapping = aes(x = culmen_length_mm, y = body_mass_g, color = species, shape=island))
ggplotly(p)
We can observe the actual magnitude of correlation from the plot below. Culmen length and and culmen depth aren’t highly correlated, but as seen above they are highly correlated among each species.
Culmen length is highly correlated with flipper length and body mass and both are positively correlated
Culmen depth is highly correlated with flipper length but is a negative correlation
Flipper length and body mass show high positive correlated
library(corrplot)
M<-cor(Clean_DF[3:6])
corrplot(M,method="color",addCoef.col = "white")